Annotating Complex Linguistic Features in Bilingual Corpora: The Case of MULTINOT
نویسنده
چکیده
In spite of the current need in the computational community for digital corpora in different languages with complex linguistic annotations going beyond morphosyntactic features, there is not much work within the Digital Humanities community dedicated to this task. In this paper I describe recent work on the development of a bilingual (English-Spanish) corpus consisting of original comparable and parallel texts from a variety of genres and annotated with complex linguistic features such as modality and evidentiality, metadiscourse markers, and thematisation, as carried out within the framework of the MULTINOT project (Lavid et al. 2015).
منابع مشابه
Using sign language corpora as bilingual corpora for data mining: Contrastive linguistics and computer-assisted..
More and more sign languages nowadays are now documented by large scale digital corpora. But exploiting sign language (SL) corpus data remains subject to the time consuming and expensive manual task of annotating. In this paper, we present an ongoing research that aims at testing a new approach to better mine SL data. It relies on the methodology of corpus-based contrastive linguistics, exploit...
متن کاملEM-based Hybrid Model for Bilingual Terminology Extraction from Comparable Corpora
In this paper, we present an unsupervised hybrid model which combines statistical, lexical, linguistic, contextual, and temporal features in a generic EMbased framework to harvest bilingual terminology from comparable corpora through comparable document alignment constraint. The model is configurable for any language and is extensible for additional features. In overall, it produces considerabl...
متن کاملBuilding bilingual terminologies from comparable corpora: the TTC TermSuite
In this paper, we exploit domain-specific comparable corpora to build bilingual terminologies. We present the monolingual term extraction and the bilingual alignment that will allow us to identify and translate high specialised terminology. We stress the huge importance of taking into account both simple and complex terms in a multilingual environment. Such linguistic diversity implies to combi...
متن کاملAnnotating Syllable Corpora with Linguistic Data Categories in XML
The usefulness of high quality annotated corpora as a development aid in computational linguistic applications is now well understood. Therefore it is necessary to have systematic, easily understandable and effective means for annotating corpora at many levels of linguistic description using. This paper presents a three step methodology for annotating speech corpora using linguistic data catego...
متن کاملNamed Entities Translation Based On Comparable Corpora
In this paper we present a system for translating named entities from Basque to Spanish based on comparable corpora. For that purpose we have tried two approaches: one based on Basque linguistic features, and a language-independent tool. For both tools we have used BasqueSpanish comparable corpora, a bilingual dictionary and the web as resources.
متن کامل